A Novel Method for Bilingual Web Page Acquisition from Search Engine Web Records

نویسندگان

  • Yanhui Feng
  • Yu Hong
  • Zhenxiang Yan
  • Jian-Min Yao
  • Qiaoming Zhu
چکیده

A new approach has been developed for acquiring bilingual web pages from the result pages of search engines, which is composed of two challenging tasks. The first task is to detect web records embedded in the result pages automatically via a clustering method of a sample page. Identifying these useful records through the clustering method allows the generation of highly effective features for the next task which is high-quality bilingual web page acquisition. The task of high-quality bilingual web page acquisition is a classification problem. One advantage of our approach is that it is search engine and domain independent. The test is based on 2516 records extracted from six search engines automatically and annotated manually, which gets a high precision of 81.3% and a recall of 94.93%. The experimental results indicate that our approach is very effective.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Hybrid Method for Web Pages Ranking in Search Engines

There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...

متن کامل

بررسی واکنش موتورهای کاوش وب به پیشینه‌های فرادا‌ده‌ای مبتنی برروش ترکیبی داده‌های خرد و روش داده‌های پیوندی

The purpose of this research was to find out the reaction of Web Search Engines to Metadata records created based on the combined method of Rich Snippets and Linked Data. 200 metadata records in two groups (100 records as the control group with the normal structure and, 100 records created based on microdata and implemented in RDF/XML as experimental group) extracted from the information gatewa...

متن کامل

Automatically Extracting Subsequent Response Pages from Web Search Sources

Usually, when Web search sources such as search engines and deep Websites retrieve too many result records for a given query, they split them among several pages with, say, ten or twenty records on each page and return only the page that has the top ranked records. This page usually provides one or more hyperlinks or buttons pointing to one or more of the remaining response pages (called subseq...

متن کامل

Integrating Cross-Lingually Relevant News Articles and Monolingual Web Documents in Bilingual Lexicon Acquisition

In the framework of bilingual lexicon acquisition from cross-lingually relevant news articles on the Web, it is relatively harder to reliably estimate bilingual term correspondences for low frequency terms. Considering such a situation, this paper proposes to complementarily use much larger monolingual Web documents collected by search engines, as a resource for reliably re-estimating bilingual...

متن کامل

Using the Web as a Bilingual Dictionary

We present a system for extracting an English translation of a given Japanese technical term by collecting and scoring translation candidates from the web. We first show that there are a lot of partially bilingual documents in the web that could be useful for term translation, discovered by using a commercial technical term dictionary and an Internet search engine. We then present an algorithm ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010